This is the dilemma of a reputed US Airline carrier 'Falcon airlines'. They aim to determine the relative importance of each parameter with regards to their contribution to passenger satisfaction. Provided is a random sample of 90,917 indivudiuals who travelled using their Flights . The On-time performance of the flights along with the passenger's information is published in the csv file named 'Flight data'. These passengers were asked to provide their feedback at the end of their flights on various parameters along with their overall experience. These collected details are made available in the survey report csv labelled 'Survey data'.
In the Survey the passengers were explicitly asked whether they were satisfied with their overall flight experience and that is captured in the data of survey report under the variable labelled 'Satisfaction'.
The Problem consists of two separate datasets : Flight data and Survey data.
You are expected to treat both the Datasets as raw data and perform any necessary cleaning / validation steps as required.
The problem of passenger dissatisfaction has the impact of decreased Airline Passenger numbers, which affects the Aviation business drastically. So, a good starting point would be to analyse the significant attributes as well as predict whether a passenger will be satisfied with the same in future or not.
The Project / study is primarily aimed at analyzing the key factors driving seamless Customer satisfaction. And Secondly , building a Predictive Model which would cater to the Business needs as to whether a passenger would be satisfied with the services or not.
In order to remain competitive ,its essential that the Airlines caters to their passengers efficiently which could otherwise have an impact on the Company's profitability.This study aims at providing clear business Insights pertaining to the attributes that drive passenger satisfaction and secondly come up with a robust predictive model which would predict whether a passenger is likely to be satisfied or not .
The raw Data provided ,consists of two separate datasets : "Flight data",which contains On-time performance of Flights and Passenger information and "Survey data" which contains passenger's feedback at the end of their flights ,on various parameters along with their Overall experience.
import pandas as pd #data processing csv file I/O (eg. pd.read_csv)
import numpy as np #Linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from scipy.cluster.hierarchy import dendrogram, linkage,cophenet
from sklearn.cluster import AgglomerativeClustering
import warnings
from scipy.stats import zscore
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
#libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier)
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier
airdata=pd.read_csv('AirplaneData.csv')
#Making a Copy of the original data set:
data = airdata.copy()
len(data)
data.shape
Observations:
The Data set consists of 90917 observations and 24 Attributes.
data.dtypes
Observation
data.dtypes.value_counts() #Count the Data types
data.head()
Observation
data.info()
Observation
# find categorical variables
categorical = [var for var in data.columns if data[var].dtype=='O']
print('There are {} categorical variables\n'.format(len(categorical)))
print('The categorical variables are :', categorical)
# lets check duplicate observations
data.duplicated().sum()
Observation
# df.describe() gives us an understanding of the Central tendencies of Data
data.describe(include='all').T
Observation
print("The average Customer Age {:.4f} years, 50% of Customers are of {} Age or less, while the maximum Customer Age is {}.".format(data['Age'].mean(),data['Age'].quantile(0.50), data['Age'].max()))
# Quick way to separate numeric columns
data.describe().columns
data = data.rename(columns={'CustomerId': 'Customer_Id', 'CustomerType': 'Customer_Type','TypeTravel': 'Travel_Type',
'Class': 'Travel_Class','Departure.Arrival.time_convenient': 'Dep_Arriv_time_convenient',
'Leg_room_service': 'Legroom_service','Inflightwifi_service': 'Inflght_wifi_service',
'Inflight_entertainment': 'Inflght_entrtnmnt',
'Ease_of_Onlinebooking': 'Ease_of_Online_bkng','DepartureDelayin_Mins': 'DeprtDelayin_Mins',
'ArrivalDelayin_Mins': 'ArrivDelayin_Mins'})
data.head()
Observation
# Lets see unique values
colmns = data.columns
for col in colmns:
print('Unique Values of {} are \n'.format(col),data[col].unique())
print('*'*90)
Observation
#checking for missing values
data.isna().sum().sort_values(ascending = False)
Observation
percent = (data.isnull().sum()/len(data)).round(4)*100
print(percent)
Observation
total = data.isnull().sum().sort_values(ascending=False) # total number of null values
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False).round(4)*100
missing=pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing)
Observation
to_be_cat = ['Gender' , 'Customer_Type' , 'Travel_Class' , 'Travel_Type', 'Seat_comfort', 'Dep_Arriv_time_convenient',
'Food_drink', 'Gate_location', 'Inflght_wifi_service',
'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
'Onboard_service', 'Legroom_service', 'Baggage_handling',
'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction']
for col in to_be_cat:
data[col] = data[col].astype('category')
data.info()
Observations
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(10, 8), bins=None,xlabelsize=12, ylabelsize=10):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2, color="darkgreen"
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="purple", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data["Customer_Id"])
Observation
Observations on Age
histogram_boxplot(data["Age"])
sns.distplot(data1['Age'],color='magenta',hist_kws={"color": "darkgreen"})
Obsevation
Observations on Flight Distance
histogram_boxplot(data["Flight_Distance"])
Observation
Observation on Departure Delay in Mins
histogram_boxplot(data["DeprtDelayin_Mins"])
Observation
Observations on Arrival Delay in Mins
histogram_boxplot(data["ArrivDelayin_Mins"])
Observations
# Function to create barplots that indicate percentage for each category.
def perc_on_bar(plot, feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=12) # annotate the percantage
plt.show() # show the plot
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Gender"], palette="plasma")
perc_on_bar(ax, data["Gender"])
Observation
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Customer_Type"], palette="plasma")
perc_on_bar(ax, data["Customer_Type"])
Observations
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Travel_Type"], palette="plasma")
perc_on_bar(ax, data["Travel_Type"])
Observation
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Travel_Class"], palette="plasma")
perc_on_bar(ax, data["Travel_Class"])
Observation
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Seat_comfort"], palette="plasma")
perc_on_bar(ax, data["Seat_comfort"])
Observation
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Dep_Arriv_time_convenient"], palette="plasma")
perc_on_bar(ax, data["Dep_Arriv_time_convenient"])
Observation
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Food_drink"], palette="plasma")
perc_on_bar(ax, data["Food_drink"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Gate_location"], palette="plasma")
perc_on_bar(ax, data["Gate_location"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Inflght_wifi_service"], palette="plasma")
perc_on_bar(ax, data["Inflght_wifi_service"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Inflght_entrtnmnt"], palette="plasma")
perc_on_bar(ax, data["Inflght_entrtnmnt"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Online_support"], palette="plasma")
perc_on_bar(ax, data["Online_support"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Ease_of_Online_bkng"], palette="plasma")
perc_on_bar(ax, data["Ease_of_Online_bkng"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Onboard_service"], palette="plasma")
perc_on_bar(ax, data["Onboard_service"])
# Prepare Data
df = data.groupby('Onboard_service').size()
sns.set(palette="Paired")
# Make the plot with pandas
df.plot(kind='pie', subplots=True, figsize=(5, 5))
plt.title("Pie Chart of On-Board Service")
plt.ylabel("")
plt.show()
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Legroom_service"], palette="plasma")
perc_on_bar(ax, data["Legroom_service"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Baggage_handling"], palette="plasma")
perc_on_bar(ax, data["Baggage_handling"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Checkin_service"], palette="plasma")
perc_on_bar(ax, data["Checkin_service"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Cleanliness"], palette="plasma")
perc_on_bar(ax, data["Cleanliness"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Online_boarding"], palette="plasma")
perc_on_bar(ax, data["Online_boarding"])
Observation
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Satisfaction"], palette="plasma")
perc_on_bar(ax, data["Satisfaction"])
freq_table = data["Satisfaction"].value_counts().to_frame()
freq_table.reset_index(inplace=True) # reset index
freq_table.columns = [ "Satisfaction" , "Cnt_Satisfaction"] # rename columns
freq_table["Pecentage"] = freq_table["Cnt_Satisfaction"] / sum(freq_table["Cnt_Satisfaction"])
freq_table
# Python Pie Chart code with formatting
plt.figure(figsize=(5,5))
#colors = ['turquoise', 'lightcoral']
sns.set(palette="Set2")
explode = (0.1, 0, 0, 0) # explode 1st slice
# Plot
plt.pie(freq_table['Cnt_Satisfaction'],
labels=freq_table['Satisfaction'],
colors=colors,
autopct='%1.1f%%',
shadow=True, startangle=140)
plt.axis('equal')
plt.show()
Observation
all_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15,7))
sns.heatmap(data[all_col].corr(),
annot=True,
linewidths=0.5,vmin=-1,vmax=1,
center=0,cmap='cividis',
cbar=True,)
plt.show()
Observation
sns.pairplot(data1[['ArrivDelayin_Mins' , 'DeprtDelayin_Mins']])
plt.show()
Observation
# Listing Categorical variables
categorical_cols=['Gender', 'Customer_Type', 'Travel_Type',
'Travel_Class', 'Seat_comfort', 'Dep_Arriv_time_convenient',
'Food_drink', 'Gate_location', 'Inflght_wifi_service',
'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
'Onboard_service', 'Legroom_service', 'Baggage_handling',
'Checkin_service', 'Cleanliness', 'Online_boarding']
## Function to plot stacked bar chart
def stacked_plot(x):
sns.set(palette="Set1")
tab1 = pd.crosstab(x, data["Satisfaction"], margins=True)
print(tab1)
print("-" * 120)
tab = pd.crosstab(x, data["Satisfaction"], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(8, 5))
# plt.legend(loc='lower left', frameon=False)
# plt.legend(loc="upper left", bbox_to_anchor=(0,1))
plt.show()
stacked_plot(data["Gender"])
Observation
stacked_plot(data["Seat_comfort"])
Obsevation
stacked_plot(data["Food_drink"])
Observation
stacked_plot(data["Gate_location"])
Observation
stacked_plot(data["Inflght_wifi_service"])
Observation
stacked_plot(data["Inflght_entrtnmnt"])
Observations
stacked_plot(data["Online_support"])
Observation
stacked_plot(data["Ease_of_Online_bkng"])
Observation
stacked_plot(data["Legroom_service"])
Observation
stacked_plot(data["Checkin_service"])
Observation
plt.figure(figsize=(10, 5))
sns.boxplot(x='Travel_Class',y='Age',data=data,hue='Gender',palette="RdYlGn")
Observation
sns.boxplot(x='Customer_Type',y='Age',data=data,hue='Gender')
Observation
plt.figure(figsize=(10, 5))
sns.boxplot(x='Customer_Type',y='Age',data=data,hue='Satisfaction',palette="bright")
Observation
plt.figure(figsize=(10, 5))
sns.boxplot(x='Inflght_wifi_service',y='Flight_Distance',data=data,hue='Satisfaction',palette="bright")
Observation
f,ax=plt.subplots(1,2,figsize=(12,7))
sns.set(palette="Paired")
data['Satisfaction'][data['Gender']=='Male'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[0],shadow=True)
data['Satisfaction'][data['Gender']=='Female'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[1],shadow=True)
ax[0].set_title('Satisfied (Male)')
ax[1].set_title('Satisfied (Female)')
plt.show()
Observation
pd.pivot_table(data, index = 'Satisfaction', values= ['Age', 'Flight_Distance', 'DepartureDelayin_Mins', 'ArrivalDelayin_Mins'])
Observation
data1=data.copy()
replaceStruct = {
"Seat_comfort": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Dep_Arriv_time_convenient": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Food_drink": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Gate_location": {"very inconvinient": 0, "Inconvinient": 1 ,"need improvement": 2 ,"manageable":3, "Convinient":4 , "very convinient": 5},
"Inflght_wifi_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Inflght_entrtnmnt": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Online_support": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Ease_of_Online_bkng": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Onboard_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Legroom_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Baggage_handling": {"poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Checkin_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Cleanliness": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Online_boarding": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
"Satisfaction": {"satisfied": 1, "neutral or dissatisfied": 0 }
}
data1=data1.replace(replaceStruct)
data1.head()
data.drop('Customer_Id', axis=1, inplace=True)
data.drop('Customer_Type', axis=1, inplace=True)
data1.drop('Customer_Id', axis=1, inplace=True)
data1.drop('Customer_Type', axis=1, inplace=True)
data.shape #Look at the shape of dataset
data1.shape
Observation
data1.sample(5)
Observation
# number of missing values (only the ones recognised as missing values) in each of the attributes
pd.DataFrame( data1.isnull().sum(), columns= ['Number of missing values'])
Observation
data1.isnull().sum().sum() # Total number of recognised missing values in the entire dataframe
Observation
# most rows don't have missing values now
num_missing = data1.isnull().sum(axis=1)
num_missing.value_counts()
data1[num_missing == 2].head(5)
data1['Travel_Type'] = data1['Travel_Type'].astype(str).replace('nan', 'is_missing').astype('category')
data1['Travel_Type'].value_counts()
data1.loc[data1.Travel_Class == "Eco", "Travel_Type"] = "Personal Travel"
data1['Travel_Type'].value_counts()
data1.loc[data1.Travel_Class == "Eco Plus", "Travel_Type"] = "Personal Travel"
data1['Travel_Type'].value_counts()
data1.loc[data1.Travel_Class == "Business", "Travel_Type"] = "Business travel"
data1['Travel_Type'].value_counts()
#Removing "is_missing" category from Travel_Type
data1['Travel_Type'] = data1['Travel_Type'].cat.remove_categories(['is_missing'])
#Checking if the Category is removed
data1['Travel_Type'].value_counts()
#checking for missing values
data1.isna().sum().sort_values(ascending = False)
Observation
# outlier detection using boxplot
numerical_col = ['Flight_Distance' ]
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
plt.subplot(5,4,i+1)
plt.boxplot(data1[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
def treat_outliers(data1,col):
'''
treats outliers in a varaible
col: str, name of the numerical varaible
df: data frame
col: name of the column
'''
Q1=data1[col].quantile(0.25) # 25th quantile
Q3=data1[col].quantile(0.75) # 75th quantile
IQR=Q3-Q1
Lower_Whisker = Q1 - 1.5*IQR
Upper_Whisker = Q3 + 1.5*IQR
data1[col] = np.clip(data1[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker
# and all the values above upper_whishker will be assigned value of upper_Whisker
return data1
def treat_outliers_all(data1, col_list):
'''
treat outlier in all numerical varaibles
col_list: list of numerical varaibles
df: data frame
'''
for c in col_list:
data1 = treat_outliers(data1,c)
return data1
numerical_col2 =['Flight_Distance']
data1 = treat_outliers_all(data1,numerical_col2)
# Looking at the Boxplot for after treating Outliers
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col2):
plt.subplot(5,4,i+1)
plt.boxplot(data1[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observation
resp = data1.DeprtDelayin_Mins
from scipy.stats import shapiro
shapiro(resp)
Observation
num_feats=data1.dtypes[data1.dtypes!='object'].index
#Calculate Skew and Sort
skew_feats=data1[num_feats].skew().sort_values(ascending=False)
skewness=pd.DataFrame({'Skew' : skew_feats})
skewness
Observation
We see that the Columns "DeprtDelayin_Mins" and "ArrivDelayin_Mins" are skewed.
# lets plot histogram for Transformed Columns:
from scipy.stats import norm
all_col = ['DeprtDelayin_Mins', 'ArrivDelayin_Mins']
plt.figure(figsize=(15,65))
for i in range(len(all_col)):
plt.subplot(18,3,i+1)
plt.hist(data1[all_col[i]])
#sns.displot(data1[all_col[i]])
plt.tight_layout()
plt.title(all_col[i],fontsize=25)
plt.show()
for colname in all_col:
data1[colname + '_log'] = np.log(data1[colname]+1)
data1.drop(all_col, axis=1, inplace=True)
data1.columns
# lets plot histogram for Transformed Columns:
from scipy.stats import norm
all_col = ['DeprtDelayin_Mins_log', 'ArrivDelayin_Mins_log']
plt.figure(figsize=(15,65))
for i in range(len(all_col)):
plt.subplot(18,3,i+1)
plt.hist(data1[all_col[i]])
#sns.displot(df[all_col[i]], kde=True)
plt.tight_layout()
plt.title(all_col[i],fontsize=25)
plt.show()
Observation
from pandas_profiling import ProfileReport
profile = ProfileReport(data1, minimal=True, title="Pandas Profiling Report")
profile.to_notebook_iframe()
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
gender = {'Female':1, 'Male':2}
data1['Gender']=data1['Gender'].map(gender).astype('Int32')
travel_type = {'Personal Travel':1,'Business travel':2}
data1['Travel_Type']=data1['Travel_Type'].map(travel_type).astype('Int32')
travel_class = {'Business':1,'Eco':2, 'Eco Plus':3}
data1['Travel_Class']=data1['Travel_Class'].map(travel_class).astype('Int32')
X = data1
Y = data1["Satisfaction"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
#Fit and transform the train data
X_train=pd.DataFrame(imputer.fit_transform(X_train),columns=X_train.columns)
#Transform the test data
X_test=pd.DataFrame(imputer.transform(X_test),columns=X_test.columns)
#Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Observation
## Function to inverse the encoding
def inverse_mapping(x,y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')
inverse_mapping(gender,'Gender')
inverse_mapping(travel_type,'Travel_Type')
inverse_mapping(travel_class,'Travel_Class')
cols = X_train.select_dtypes(include=['object','category'])
for i in cols.columns:
print(X_train[i].value_counts())
print('*'*30)
cols = X_test.select_dtypes(include=['object','category'])
for i in cols.columns:
print(X_test[i].value_counts())
print('*'*30)
#Converting Float to Int:
to_be_int = ['Age',
'Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
'Online_boarding', 'Satisfaction']
for col in to_be_int:
X_train[col] = X_train[col].astype('int64')
#Converting Float to Int:
to_be_int = ['Age',
'Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
'Online_boarding', 'Satisfaction']
for col in to_be_int:
X_test[col] = X_test[col].astype('int64')
data_survey=X_train[['Seat_comfort', 'Dep_Arriv_time_convenient',
'Food_drink', 'Gate_location', 'Inflght_wifi_service',
'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
'Onboard_service', 'Legroom_service', 'Baggage_handling',
'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction']]
X_train.Satisfaction.describe()
plt.figure(figsize=(6, 5))
plt.hist(X_train.Satisfaction.values, bins=100)
plt.title('Histogram target counts')
plt.xlabel('Count')
plt.ylabel('Satisfaction')
plt.show()
Observation
plt.figure(figsize=(15, 8))
sns.heatmap(X_train.corr(), annot=True,fmt='.1g')
Observation
plt.figure(figsize=(16, 5))
cols = X_train.columns
uniques = [len(X_train[col].unique()) for col in cols]
sns.set(font_scale=1.1)
ax = sns.barplot(cols, uniques, palette='hls', log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique per feature')
for p, uniq in zip(ax.patches, uniques):
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 10,
uniq,
ha="center")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
pd.pivot_table(data_survey, index = 'Satisfaction', values= ['Seat_comfort', 'Dep_Arriv_time_convenient',
'Food_drink', 'Gate_location', 'Inflght_wifi_service',
'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
'Onboard_service', 'Legroom_service', 'Baggage_handling',
'Checkin_service', 'Cleanliness', 'Online_boarding'])
Observation
In-Flight Entertainment seems to have got the Highest Avg Satisfaction Ratings by the Customers driving them towards Satisfied.
table = pd.pivot_table(data=X_train,index=['Travel_Class'])
table
Observation
data_survey[['Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
'Online_boarding', 'Satisfaction']].sum()
#Lets visualise this :
# Created a dataframe to figure out the highest Total Ratings.
total_survey=pd.DataFrame(data_survey.sum(), columns= ['Total'])
my_colors = 'ccccccccccccccg'
total_survey.sort_values(['Total']).plot(kind='bar',figsize=(15,5),color=my_colors)
plt.show()
Observation
# Creating a Dataset with only 5 Star Ratings:
colmns=data_survey.columns
Rating5=data_survey[data_survey[colmns] == 5]
# Lets look at the value Counts for Ratings : 5 for all Survey variables:
colmns = Rating5.columns
for col in colmns:
print('Value Counts of {} are \n'.format(col),Rating5[col].value_counts())
print('*'*90)
Observation
# Creating a dataset with only 0 ratings:
colmns=data_survey.columns
Rating0=data_survey[data_survey[colmns] == 0]
# Lets look at the value count for 0 Rating:
colmns = Rating0.columns
for col in colmns:
print('Value Counts of {} are \n'.format(col),Rating0[col].value_counts())
print('*'*90)
Observation
plt.figure(figsize=(15,5))
sns.pointplot(x="Travel_Class", y="Age", hue = 'Gender', data=X_train)
plt.show()
Observation
freq_table2 = X_train["Baggage_handling"].value_counts().to_frame()
freq_table2.reset_index(inplace=True) # reset index
freq_table2.columns = [ "Baggage_handling" , "Cnt_Baggage_handling"] # rename columns
freq_table2["Pecentage"] = freq_table2["Cnt_Baggage_handling"] / sum(freq_table2["Cnt_Baggage_handling"])
freq_table2
# Python Pie Chart code with formatting
plt.figure(figsize=(7,7))
colors = ['dodgerblue', 'blue', 'navy','cornflowerblue','powderblue','silver']
#sns.set(palette="Paired")
explode = (0.1, 0, 0, 0) # explode 1st slice
# Plot
plt.pie(freq_table2['Cnt_Baggage_handling'],
labels=freq_table2['Baggage_handling'],
colors=colors,
autopct='%1.1f%%',
shadow=True, startangle=140)
plt.axis('equal')
plt.show()
Observation
f,ax=plt.subplots(1,2,figsize=(10,5))
sns.set(palette="muted")
#colors = ['dodgerblue', 'blue', 'navy','cornflowerblue','powderblue','silver']
data1[['Travel_Type','Satisfaction']].groupby(['Travel_Type']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Satisfied vs Travel Type')
sns.countplot('Travel_Type',hue='Satisfaction',data=X_train,ax=ax[1])
ax[1].set_title('Travel Type:Dissatisfied vs Satisfied')
plt.show()
Observation
print("The average Customer Departure Delay is {:.2f} minutes, 50% of Records show {:.2f} minutes of Departure Delay or less, while the maximum Delay in Departure is {:.2f} mins."
.format(X_train['DeprtDelayin_Mins_log'].mean(),X_train['DeprtDelayin_Mins_log'].quantile(0.50), X_train['DeprtDelayin_Mins_log'].max()))
print("The average Customer Arrival Delay is {:.2f} minutes, 50% of Records show {:.2f} minutes of Arrival Delay or less,while the maximum Delay in Arrival is {:.2f} mins.".format(X_train['ArrivDelayin_Mins_log'].mean(),X_train['ArrivDelayin_Mins_log'].quantile(0.50), X_train['ArrivDelayin_Mins_log'].max()))
print("The average Rating for Food n Drink is {:.0f} stars, 50% of Records show {:.0f} star Rating or lower for Food n Drink , while the maximum Rating is {:.0f} Stars.".format(X_train['Food_drink'].mean(),X_train['Food_drink'].quantile(0.50), X_train['Food_drink'].max()))
print("The average Rating for Onboard_service is {:.0f} stars, 50% of Records show {:.0f} star Rating or lower for Onboard_service , while the maximum Rating is {:.0f} Stars.".format(X_train['Onboard_service'].mean(),X_train['Onboard_service'].quantile(0.50), X_train['Onboard_service'].max()))
mydata=pd.read_excel('survey_percentage.xlsx')
mydata.head()
def pareto_plot(df, x=None, y=None, title=None, show_pct_y=False, pct_format='{0:.0%}'):
xlabel = x
ylabel = y
tmp = df.sort_values(y, ascending=False)
x = tmp[x].values
y = tmp[y].values
weights = y / y.sum()
cumsum = weights.cumsum()
fig, ax1 = plt.subplots(figsize=(25,10))
ax1.bar(x, y)
ax1.set_xlabel(xlabel,size=20)
ax1.set_ylabel(ylabel,size=20)
ax2 = ax1.twinx()
ax2.plot(x, cumsum, '-ro', alpha=0.5)
ax2.set_ylabel('', color='r',size=20)
ax2.tick_params('y', colors='r',size=20)
vals = ax2.get_yticks()
ax2.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
# hide y-labels on right side
if not show_pct_y:
ax2.set_yticks([])
formatted_weights = [pct_format.format(x) for x in cumsum]
for i, txt in enumerate(formatted_weights):
ax2.annotate(txt, (x[i], cumsum[i]), fontweight='bold',size=24)
if title:
plt.title(title)
plt.tight_layout()
plt.show()
pareto_plot(mydata, x='Index', y='Satisfaction_1', title='Rating Trend')
Observation
#Creating Subset with only Satisfaction : 1
Satisfaction1 = X_train.loc[X_train.Satisfaction == 1]
Satisfaction1.groupby(["Age"])['Satisfaction'].sum().reset_index().T
Observation
Age_bin= X_train.copy()
Age_bin['Age_bin'] = pd.cut(
Age_bin['Age'], [-np.inf, 11, 31, 51, np.inf],
labels = ["Under 10", "upto 30", "31 to 50'", "Above 50"]
)
Age_bin.drop(['Age'], axis=1, inplace=True)
Age_bin['Age_bin'].value_counts(dropna=False)
tab1 = pd.crosstab(Age_bin.Satisfaction,Age_bin.Age_bin,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(Age_bin.Satisfaction,Age_bin.Age_bin,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(8,6))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Observation
Agegroup = Satisfaction1.groupby(by=['Age'], as_index=False)['Satisfaction'].count()
plt.subplots(figsize=(15,6))
plt.plot(Agegroup.Age, Agegroup.Satisfaction)
plt.xlabel('Customer Age')
plt.ylabel('Satisfaction count')
plt.title('Satisfaction Rate by different Age Group')
plt.show()
Observation
print('The TOP 5 Services most likely to sway the Customers towards Satisfaction...')
mydata.sort_values(by='Satisfaction_1', ascending=False).head()
data_survey.apply(np.min)
Observation
X_train.groupby(["Travel_Class"])["Satisfaction"].agg([np.mean]).sort_values(by="mean", ascending=False).T
Observation
sns.relplot(x="Age", y="Flight_Distance", hue="Satisfaction",
col="Travel_Class", data=X_train).set_xticklabels(rotation=30);
Observation
sns.relplot(x="Flight_Distance", y="ArrivDelayin_Mins_log", hue="Satisfaction",
col="Travel_Class", data=X_train).set_xticklabels(rotation=30);
Observation
X_train=pd.get_dummies(X_train)
X_test=pd.get_dummies(X_test)
print(X_train.shape, X_test.shape)
X_train.head(2)
X_test.head(2)
X_train.columns
X_test.columns
Observation
X_train.drop('Satisfaction', axis=1, inplace=True)
X_test.drop('Satisfaction', axis=1, inplace=True)
X_train.shape
X_test.shape
True Positives:
Reality: A customer is Satisfied. Model predicted: The customer is likely to be Satisfied. Outcome: The model is good.
True Negatives:
Reality: A customer is Dissatisfied. Model predicted: The customer will be Dissatisfied. Outcome: The business is unaffected.
False Positives:
Reality: A customer was Dissatisfied. Model predicted: The customer will be Satisfied. Outcome: The team which is targeting the potential customers will be wasting their resources on the people/customers who will not be contributing to the revenue.
False Negatives:
Reality: A customer is likely to be Satisfied. Model predicted: The customer will be Dissatisfied. Outcome: The potential customer is missed by the sales/marketing team ,hence affecting the business.
In this case, not being able to identify a potential customer is the biggest loss we can face. Hence, RECALL is the right metric to check the performance of the model.
Satisfied Customer (Class : 1)
Dissatisfied / Neutal Customer (Class : 0)
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(y_actual,model,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["Satisfied","Dissatisfied"]],
columns = [i for i in ['Satisfied','Dissatisfied']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
X_train.dtypes
from statsmodels.stats.outliers_influence import variance_inflation_factor
# dataframe with numerical column only
num_feature_set = X_train.select_dtypes(include=['int64','float64'])
from statsmodels.tools.tools import add_constant
num_feature_set = add_constant(num_feature_set)
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
We do not see any high VIF Scores. Hence, no visible Multicollinearity.
lr = LogisticRegression(random_state=1)
lr.fit(X_train,y_train)
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_bfr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_bfr)
plt.show()
Observation:
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(y_test,lr)
Observation on Scores
lr.predict_proba(X_test)
#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])
plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
Observation
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(optimal_threshold)
Observation
def make_confusion_matrix(y_actual,y_predict,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
#y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["Satisfied","Dissatisfied"]],
columns = [i for i in ['Satisfied','Dissatisfied']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
target_names = ['Dissatisfied', 'Satisfied']
y_pred_tr = (lr.predict_proba(X_train)[:,1]>optimal_threshold).astype(int)
y_pred_ts = (lr.predict_proba(X_test)[:,1]>optimal_threshold).astype(int)
make_confusion_matrix(y_test,y_pred_ts)
Observation
print("Accuracy on training set : ",metrics.accuracy_score(y_train,y_pred_tr))
print("Accuracy on test set : ",metrics.accuracy_score(y_test,y_pred_ts))
print("Recall on training set : ",metrics.recall_score(y_train,y_pred_tr))
print("Recall on test set : ",metrics.recall_score(y_test,y_pred_ts))
print("Precision on training set : ",metrics.precision_score(y_train,y_pred_tr))
print("Precision on test set : ",metrics.precision_score(y_test,y_pred_ts))
Observation
y_proba=lr.predict_proba(X_test)[:,1]
from sklearn.metrics import roc_curve, precision_recall_curve
def threshold_search(y_test,y_proba, plot=False):
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
thresholds = np.append(thresholds, 1.001)
F = 2 / (1/precision + 1/recall)
best_score = np.max(F)
best_th = thresholds[np.argmax(F)]
if plot:
plt.plot(thresholds, F, '-b')
plt.plot([best_th], [best_score], '*r')
plt.show()
search_result = {'threshold': best_th , 'f1': best_score}
return search_result
threshold_search(y_test,y_proba)
threshold=0.5127350164199415
target_names = ['Dissatisfied', 'Satisfied']
y_pred_tr = (lr.predict_proba(X_train)[:,1]>=threshold).astype(int)
y_pred_ts = (lr.predict_proba(X_test)[:,1]>=threshold).astype(int)
make_confusion_matrix(y_test,y_pred_ts)
print("Accuracy on training set : ",metrics.accuracy_score(y_train,y_pred_tr))
print("Accuracy on test set : ",metrics.accuracy_score(y_test,y_pred_ts))
print("Recall on training set : ",metrics.recall_score(y_train,y_pred_tr))
print("Recall on test set : ",metrics.recall_score(y_test,y_pred_ts))
print("Precision on training set : ",metrics.precision_score(y_train,y_pred_tr))
print("Precision on test set : ",metrics.precision_score(y_test,y_pred_ts))
Observation
Since the Original Logistic Regression Model with Optimum Threshold gave us the best RECALL Score , Lets interpret the Coeffs for the Same:
log_odds = lr.coef_[0]
pd.DataFrame(log_odds, X_train.columns, columns=['coef']).T
odds = np.exp(lr.coef_[0])-1
pd.set_option('display.max_rows',None)
pd.DataFrame(odds, X_train.columns, columns=['odds']).T
data = np.array([-0.02,-0.00, 0.21, -0.097 , -0.19, -0.08, -0.23, 0.82, 0.08, 0.58, 0.25, 0.13, -0.10, 0.07 ,-0.13, 0.02, -0.00, -0.15, 0.14, -0.37, 0.27, -0.43, 0.27, -0.38, -0.07])
labels = ['Age', 'Flight_Distance','Seat_comfort','Dep_Arriv_time_convenient','Food_drink','Gate_location','Inflght_wifi_service','Inflght_entrtnmnt','Online_support','Ease_of_Online_bkng','Onboard_service','Legroom_service','Baggage_handling','Checkin_service','Cleanliness','Online_boarding','DeprtDelayin_Mins_log','ArrivDelayin_Mins_log','Gender_Female','Gender_Male','Travel_Type_Business travel','Travel_Type_Personal Travel','Travel_Class_Business','Travel_Class_Eco','Travel_Class_Eco Plus']
import matplotlib.pylab as pl
pl.figure(figsize=(30, 5))
ax=pl.subplot(122)
pl.bar(np.arange(data.size), data)
ax.set_xticks(np.arange(data.size))
ax.set_xticklabels(labels)
ax.set_xticklabels(labels, rotation = 45, ha="right")
models = [] # Empty list to store all the models
# Appending pipelines for each model into the list
models.append(
(
"RF",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"GBM",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"BG",
Pipeline(
steps=[
("scaler", StandardScaler()),
("bagging", BaggingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"ADB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGB",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
]
),
)
)
models.append(
(
"DTREE",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1)),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Observation
We can see that XGB (XG-Boost) has given the highest CV Score of 94.4%.
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = metrics.recall_score(y_train, pred_train)
test_recall = metrics.recall_score(y_test, pred_test)
train_precision = metrics.precision_score(y_train, pred_train)
test_precision = metrics.precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
print(
"Precision on training set : ", metrics.precision_score(y_train, pred_train)
)
print("Precision on test set : ", metrics.precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
def make_confusion_matrix(y_actual,model,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["Satisfied","Dissatisfied"]],
columns = [i for i in ['Satisfied','Dissatisfied']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(class_weight={0:0.45,1:0.55},random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"decisiontreeclassifier__criterion": ['gini','entropy'],
"decisiontreeclassifier__max_depth": [3, 4, 5, None],
"decisiontreeclassifier__min_samples_split": [2,4,7,10,15]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
# Creating new pipeline with best parameters
dtree_tuned1 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(random_state=1, criterion='entropy', max_depth=5, min_samples_split=2),
)
# Fit the model on training data
dtree_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(dtree_tuned1)
# Creating confusion matrix
make_confusion_matrix(y_test,dtree_tuned1)
Observations:
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(class_weight={0:0.45,1:0.55},random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"decisiontreeclassifier__criterion": ['gini','entropy'],
"decisiontreeclassifier__max_depth": [3, 4, 5, None],
"decisiontreeclassifier__min_samples_split": [2,4,7,10,15]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
dtree_tuned2 = make_pipeline(
StandardScaler(),
DecisionTreeClassifier(random_state=1, criterion='entropy', max_depth=5, min_samples_split=4),
)
# Fit the model on training data
dtree_tuned2.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(dtree_tuned2)
# Creating confusion matrix
make_confusion_matrix(y_test,dtree_tuned2)
Observations:
# Creating pipeline
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(class_weight={0:0.45,1:0.55},random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"randomforestclassifier__n_estimators": [100],
"randomforestclassifier__bootstrap": [True],
"randomforestclassifier__max_depth": [3, 5, 7],
"randomforestclassifier__max_features": ['auto', 'sqrt','log2'],
"randomforestclassifier__min_samples_leaf": [2, 3, 5],
"randomforestclassifier__min_samples_split": [3, 5, 7]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
# Creating new pipeline with best parameters
RF_tuned1 = make_pipeline(
StandardScaler(),
RandomForestClassifier(random_state=1,class_weight={0:0.45,1:0.55},max_features='auto',n_estimators=100,min_samples_leaf=2,bootstrap=True, max_depth=7, min_samples_split=7),
)
# Fit the model on training data
RF_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(RF_tuned1)
# Creating confusion matrix
make_confusion_matrix(y_test,RF_tuned1)
Observation
# Creating pipeline
pipe = make_pipeline(StandardScaler(),RandomForestClassifier(class_weight={0:0.45,1:0.55},random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"randomforestclassifier__n_estimators": [100],
"randomforestclassifier__bootstrap": [True],
"randomforestclassifier__max_depth": [3, 5, 7],
"randomforestclassifier__max_features": ['auto', 'sqrt','log2'],
"randomforestclassifier__min_samples_leaf": [2, 3, 5],
"randomforestclassifier__min_samples_split": [3, 5, 7]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
RF_tuned2 = make_pipeline(
StandardScaler(),
RandomForestClassifier(random_state=1,class_weight={0:0.45,1:0.55},bootstrap=True,n_estimators=100,min_samples_leaf=5,max_features='sqrt',max_depth=7, min_samples_split=3),
)
# Fit the model on training data
RF_tuned2.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(RF_tuned2)
# Creating confusion matrix
make_confusion_matrix( y_test, RF_tuned2)
Observation
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), AdaBoostClassifier(random_state=1)
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__base_estimator":[DecisionTreeClassifier(max_depth=4)],
"adaboostclassifier__n_estimators": np.arange(10,110,10),
"adaboostclassifier__learning_rate":np.arange(0.1,1,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv.best_params_, grid_cv.best_score_
)
)
# Creating new pipeline with best parameters
Adb_tuned1 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(random_state=1,n_estimators=70,learning_rate=0.6,base_estimator=DecisionTreeClassifier(max_depth=4)),
)
# Fit the model on training data
Adb_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(Adb_tuned1)
# Creating confusion matrix
make_confusion_matrix(y_test,Adb_tuned1)
Observation
# Creating pipeline
pipe = make_pipeline(StandardScaler(),AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__base_estimator":[DecisionTreeClassifier(max_depth=4)],
"adaboostclassifier__n_estimators": np.arange(10,110,10),
"adaboostclassifier__learning_rate":np.arange(0.1,1,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=10, scoring=scorer, cv=3, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
Adb_tuned2 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(random_state=1,n_estimators=100,learning_rate=0.4,base_estimator=DecisionTreeClassifier(max_depth=4)),
)
# Fit the model on training data
Adb_tuned2.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(Adb_tuned2)
# Creating confusion matrix
make_confusion_matrix(y_test, Adb_tuned2)
Observation
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150],
"gradientboostingclassifier__subsample":[0.8,0.9,1],
"gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv.best_params_, grid_cv.best_score_
)
)
# Creating new pipeline with best parameters
Gmb_tuned1 = make_pipeline(
StandardScaler(),GradientBoostingClassifier(random_state=1,n_estimators=150,max_features=0.8,subsample=0.8),
)
# Fit the model on training data
Gmb_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(Gmb_tuned1)
# Creating confusion matrix
make_confusion_matrix(y_test,Gmb_tuned1)
Observation
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150],
"gradientboostingclassifier__subsample":[0.8,0.9,1],
"gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=10, scoring=scorer, cv=3, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
Gmb_tuned2 = make_pipeline(
StandardScaler(),GradientBoostingClassifier(random_state=1,n_estimators=150,max_features=0.8,subsample=0.9),
)
# Fit the model on training data
Gmb_tuned2.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(Gmb_tuned2)
# Creating confusion matrix
make_confusion_matrix(y_test,Gmb_tuned2)
Observation
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), XGBClassifier(random_state=1, eval_metric="logloss")
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"xgbclassifier__n_estimators": np.arange(50, 100, 50),
"xgbclassifier__scale_pos_weight": [8],
"xgbclassifier__learning_rate": [0.01, 0.1, 0.2, 0.05],
"xgbclassifier__gamma": [0, 1, 3, 5],
"xgbclassifier__subsample": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv.best_params_, grid_cv.best_score_
)
)
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
eval_metric="logloss",
n_estimators=50,
scale_pos_weight=8,
subsample=1,
learning_rate=0.2,
gamma=5,
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(xgb_tuned1)
# Creating confusion matrix
make_confusion_matrix(y_test,xgb_tuned1)
Observation -XG Boost with Grid Search CV gave us an excellent Accuracy.
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric="logloss"))
#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,100,50),'xgbclassifier__scale_pos_weight':[8],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=10, scoring=scorer, cv=3, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
eval_metric="logloss",
n_estimators=50,
scale_pos_weight=8,
learning_rate=0.2,
gamma=1,
subsample=1,
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
# Calculating different metrics
get_metrics_score(xgb_tuned2)
# Creating confusion matrix
make_confusion_matrix( y_test,xgb_tuned2)
Observation
import pandas as pd
comparison_frame = pd.DataFrame({'Model':['Initial Logistic Regression Model with sklearn', 'Increased optimal threshold - Logistic Regression Model with sklearn','Best optimal threshold - Logistic Regression Model with sklearn',
], 'Train_Accuracy':[0.79,0.79,0.79], 'Test_Accuracy':[0.79,0.79,0.79],'Train_Recall':[0.84,0.78,0.84],'Test_Recall':[0.84,0.78,0.83], 'Train_Precision':[0.79,0.83,0.79], 'Train_Accuracy':[0.79,0.82,0.79]
})
comparison_frame
Observation
# defining list of models
models = [dtree_tuned1, dtree_tuned2, RF_tuned1, RF_tuned2, Adb_tuned1, Adb_tuned2, Gmb_tuned1, Gmb_tuned2, xgb_tuned1, xgb_tuned2]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model, False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame(
{
"Model": [
"Decision-Tree-GridSearchCV",
"Decision-Tree-RandomSearchCV",
"Random-Forest-GridSearchCV",
"Random-Forest-RandomSearchCV",
"ADA-Boost-GridSearchCV",
"ADA-Boost-RandomSearchCV",
"GMB-GridSearchCV",
"GMB-RandomSearchCV",
"XG-Boost-GridSearchCV",
"XG-Boost-RandomSearchCV",
],
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
Observation
feature_names = X_train.columns
importances = xgb_tuned2[1].feature_importances_
indices = np.argsort(importances)
my_colors = 'gggyyyyyybbbbbbggggggg'
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color=my_colors, align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Insights From the Best Model
While feature importance shows what variables most affect predictions, partial dependence plots show how a feature affects predictions.
from pdpbox import pdp, get_dataset, info_plots
def plot_pdp(model, df, feature, cluster_flag=False, nb_clusters=None, lines_flag=False):
# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=xgb_tuned2, dataset=X_train, model_features=X_train.columns.tolist(), feature=feature)
# plot it
pdp.pdp_plot(pdp_goals, feature, cluster=cluster_flag, n_cluster_centers=nb_clusters, plot_lines=lines_flag)
plt.show()
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Inflght_entrtnmnt')
Observation
This PD plot show us that the Service InFlight Entertainment seems to have an increasing positive Impact on the Target variable : Satisfaction for values between 3 and 5.
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Travel_Type_Business travel')
Observation
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Gender_Female')
Observation
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Gender_Male')
Observation
We analyzed the "Falcon Airlines Customer data" using different Analysis techniques like Logistic Regression ,Ensemble Modelling methods like Bagging and Boosting , Hypertuned Models, in order to build a model to predict if a customer is likely to be swayed towards Satisfaction or Not.
All of the Models built, were evaluated for the best results in terms of our Scoring metrics which was RECALL.
We compared these Models to check which one gave us the best possible Recall score , that was our main Model Evaluation Criteria.
We build the Model keeping in mind the Model evaluation criterion - The True Positive Rates - where we tried to improve the proportions of correctly identified probabilities of yes . Hence , the Model with the best Recall and the one which would generalize well , was chosen to be the final best Model.
With the Logistic Regression Analysis , the Model with Optimal Threshold has given a generalized performance with a Test Recall of 84%.
There was still scope for improvement , so we went ahead with further analysis and tried different Models.
Hypertuned XG-Boost with Randomized Search CV has given us an extremely generalised Model. The Recall we got from this Model is 99% .There is no overfitting .So, this Model could be cosidered to be the best of all and will also generalize well.
With Tuned XG-Boost with Randomized Search CVthe features that were deemed most significant are : InFlight Entertainmnt ,Seat Comfort , Gender : Female , Ease Of Online Banking and On-Board Service are the most significant
The Air transportation industry is becoming leaner, quicker, tech-enabled, and data-driven.
Machine learning methods have transformed our ability to improve interactions for customers (who look for information, solutions, or resolution through personalized experiences) and organizations(who look to provide those personalized experiences across multiple channels in the most cost-efficient manner).
Introducing things like Well-being on board whee the passengers will receive practical tips and tricks for their well-being during flight in a sporty video with celebrity support , would enhance their Flight experience.Airlines could work closely with their Customers to continuously develop and install new features and elements into a cabin, inrder to improve the overall Flight Experience.Hence, the Airlines can map out how each customer journey can be re-designed to reflect the values driving the Airline's purpose and also sway more Customers towards Satisfaction.